feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) by elvischenv · Pull Request #70 · NVIDIA/srt-slurm

elvischenv · 2026-04-24T15:32:49Z

Based on #69

Adds eight SGLang recipes covering NVIDIA-verified DeepSeek-V4-Pro (1.6T MXFP4 MoE) aggregated-serving configurations on Grace+Blackwell: recipes/gb300-fp4/1k1k-dsv4/ agg-low-latency.yaml — TP=4 + MTP 3/4 (min TPOT) agg-nomtp.yaml — TP=4 (baseline) agg-balanced-tep.yaml — TP=4+DP=4 DeepEP + MTP 1/2 agg-max-tpt-tep.yaml — TP=4+DP=4 DeepEP (max TPS/GPU) agg-2n-low-latency.yaml — TP=8 + MTP 3/4 agg-2n-nomtp.yaml — TP=8 recipes/gb200-fp4/1k1k-dsv4/ agg-2n-low-latency.yaml — TP=8 + MTP 3/4 agg-2n-nomtp.yaml — TP=8 Flag set derived from the SGLang DSv4 cookbook (docs_new/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx): * moe-runner-backend: flashinfer_mxfp4 (MXFP4 MoE on Blackwell) * chunked-prefill-size: 4096 + disable-flashinfer-autotune: true * EAGLE spec-decoding 3/4 for low-latency, 1/2 for balanced * TEP recipes: enable-dp-attention + moe-a2a-backend: deepep + deepep-config num_sms=96 (DEEPEP_LARGE_SMS_FLAG, single-node Blackwell) * disable-radix-cache: true (synthetic bench best practice, also reduces allocator fragmentation during MXFP4 weight-reorder) * mem-fraction-static: 0.78 (0.82 intermittently OOMs GB300 during reorder_w1w3_to_w3w1; 0.78 leaves contiguous headroom) srtslurm.yaml.example: added deepseek-v4 model + container aliases. Also adds README.md in each recipe subdir with rationale + reference pointers to the SGLang cookbook and DSv4-Pro model card. Made-with: Cursor

Made-with: Cursor

codecov-commenter · 2026-04-24T15:34:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@54badf2). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #70   +/-   ##
=======================================
  Coverage        ?   70.35%           
=======================================
  Files           ?       59           
  Lines           ?     6270           
  Branches        ?        0           
=======================================
  Hits            ?     4411           
  Misses          ?     1859           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ARMUP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@ishandhanani

Adapted from @ishandhanani's dynamo-frontend disagg recipe. 1 prefill + 1 decode node at TP=4, MXFP4 FlashInfer MoE runner, NIXL transfer backend. Benchmark type is ``manual`` for external sweep control. Lives under ``recipes/gb300-fp4/1k1k-dsv4/`` to match PR NVIDIA#70's hardware-partitioned layout. Uses ``dsv4-pro`` checkpoint and ``dsv4-grace-blackwell`` container (GB300 aarch64 sglang image), since DSV4-Flash weights are not staged and the original ``dsflash`` / ``dspro`` / ``nginx`` aliases from the upstream recipe are local to @ishandhanani's environment. Dynamo hash pinned: 9d3c913d300eb368cda28b3f98a23a5762621e0d Made-with: Cursor

* feat(sa-bench): add sglang DeepSeek-V4 tokenizer Adds a client-side tokenizer for DeepSeek-V4-Pro that matches sglang server behavior. Usable via the existing 'module.path.ClassName' hook in backend_request_func.get_tokenizer; no changes to sa-bench itself. Motivation: DeepSeek-V4 ships no HF chat template, so tokenizer.apply_chat_template() raises ValueError. Sglang's server replaces the HF path with a hard-coded DSML encoder (encoding_dsv4.encode_messages) whenever arch == 'DeepseekV4ForCausalLM', per sgl-project/sglang PR #23600. Without a matching client-side encoder, sa-bench input_tokens diverges from server #new-token. Implementation: - sa_bench_tokenizers/_sglang_encoding_dsv4.py: vendored byte-exact from sgl-project/sglang@f5d03db (Apache-2.0, 840 lines). - sa_bench_tokenizers/sglang_deepseek_v4.py: HF-compatible wrapper. apply_chat_template() mirrors serving_chat exactly: 1) insert empty system message if missing 2) thinking_mode='chat', reasoning_effort=None (defaults) 3) call encode_messages(...) 4) hf_tokenizer.encode(..., add_special_tokens=False) Usage (recipe): benchmark: custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer" Recipes: recipe YAML authoring is out of scope; see #70 for DeepSeek-V4-Pro sglang recipes. * fix(sa-bench): load DeepSeek-V4 via PreTrainedTokenizerFast first AutoTokenizer.from_pretrained rejects checkpoints whose model_type (deepseek_v4) is not yet registered in mainline transformers. The V4 checkpoint ships a ready-made tokenizer.json, so prefer loading it directly through PreTrainedTokenizerFast and only fall back to AutoTokenizer (for future transformers releases that register V4). Verified offline on DeepSeek-V4-Pro: 4 representative prompts (hello, GSM8K, system+user, multi-turn) each produce client token IDs byte-identical to the sglang server path tokenizer.encode(encode_messages(msgs_with_empty_system, thinking_mode=chat)). * feat(recipes): add gb300 1-node agg smoke recipe for SGLang DeepSeek-V4 tokenizer Minimal recipe that exercises the new SGLangDeepseekV4Tokenizer wrapper end to end on a single GB300 node (TP=4, MTP 3/4, MXFP4 MoE). Used to verify client-side prompt encoding aligns with the SGLang server for DeepSeek-V4. Key knobs: - mem-fraction-static: 0.82 (required on 1-node: DSv4-Pro weights occupy ~206 GB/GPU, so a lower mfs leaves negative KV pool after SGLang reserves (1-mfs)*total_mem for activations, triggering "Not enough memory") - use_chat_template: true - custom_tokenizer: sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer * chore: add NVIDIA SPDX copyright headers to sa_bench_tokenizers files Required by NVIDIA/srt-slurm CI license check. * feat(sa-bench): mirror SGLANG_ENABLE_THINKING / SGLANG_REASONING_EFFORT env fallback The client-side DSV4 tokenizer renders prompts via the vendored ``encoding_dsv4.encode_messages`` and previously hard-coded ``thinking_mode="chat"`` and ``reasoning_effort=None``. This matches the sglang server only when the user has not set the DSV4 thinking knobs. sglang ``serving_chat.py`` (PR #23600) honors two envs as fallbacks: - ``SGLANG_ENABLE_THINKING=1`` -> ``thinking_mode="thinking"`` - ``SGLANG_REASONING_EFFORT=max|high`` -> passed through to the encoder Recipes typically set these in ``prefill_environment`` / ``decode_environment`` for reasoning eval workloads (gpqa, aime25, etc.) on DeepSeek-V4-Flash. Without matching fallback on the sa-bench client, server prompts would be wrapped in ``<think>...</think>`` while the client still rendered the chat template, desynchronizing ISL / TPOT / MTP accept-rate accounting. This change: - Adds ``_env_enable_thinking()`` / ``_env_reasoning_effort()`` helpers that parse the envs the same way sglang ``EnvBool`` / ``EnvStr`` do (``{1,true,yes,on}`` truthy; ``max|high`` filter). - Changes ``apply_chat_template`` defaults from Python literals to ``None`` sentinels; when ``None``, falls back to env. Explicit caller kwargs (incl. ``thinking=False``) still win, matching the server's ``(request.chat_template_kwargs or {}).get("thinking", env_default)`` precedence. - Expands the docstring to show the real server call chain (not just the happy-path defaults). Smoke-tested all five precedence cases (no env + no kwarg / env on + no kwarg / env on + explicit False / env off + explicit True / bogus env value filtered). Made-with: Cursor

YAMY1234 and others added 3 commits April 24, 2026 01:48

recipes/dsv4: drop warmup_multiplier / formal_multiplier (use defaults)

da535e8

Made-with: Cursor

fix memory issue && get curves

9a9f2c4

elvischenv requested review from alec-flowers, csahithi, hjjq, ishandhanani, kedarpotdar-nv, kyleliang-nv, nlevin-ui and qiching as code owners April 24, 2026 15:32

recipes/dsv4: drop JIT cache paths, enable SGLANG_JIT_DEEPGEMM_FAST_W…

31f121c

…ARMUP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced Apr 24, 2026

feat(sa-bench): add sglang DeepSeek-V4 tokenizer (depends on #71) #72

Closed

feat(sa-bench): add sglang DeepSeek-V4 tokenizer #73

Merged

ishandhanani merged commit 2cbbede into NVIDIA:main Apr 24, 2026
6 checks passed

YAMY1234 mentioned this pull request Apr 25, 2026

feat: DeepSeek-V4-Pro disaggregated 1P1D recipe on GB300 (1k/1k) #75

Closed

7 tasks

This was referenced Apr 26, 2026

[NOT FINAL] add wip DSv4 aggregate and disaggregate recipes #85

Merged

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg) #69

Closed

elvischenv deleted the dsv4-pro-recipes branch April 27, 2026 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#70

feat: DeepSeek-V4-Pro perf recipes for GB300 / GB200 (1k/1k agg)#70
ishandhanani merged 4 commits intoNVIDIA:mainfrom
elvischenv:dsv4-pro-recipes

elvischenv commented Apr 24, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elvischenv commented Apr 24, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants